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The Minkowski content Lo(G) of a body G C R'' represents the 
boundary length (for d = 2) or the surface area (for d = 3) of G. A 
method for estimating Lo(G) is proposed. It relies on a nonparamet- 
ric estimator based on the information provided by a random sample 
(taken on a rectangle containing G) in which we are able to identify 
whether every point is inside or outside G. Some theoretical prop- 
erties concerning strong consistency, Li-error and convergence rates 
are obtained. A practical application to a problem of image analysis 
in cardiology is discussed in some detail. A brief simulation study is 
provided. 

1. Introduction. The estimation of the surface area of a body G in the 
Euclidean space M'^ ("surface area" amounts to "boundary length" in the 
bidimensional case d = 2) has been extensively considered in stereology; 
see [1, 2, 12]. We are concerned here with this problem from a different 
point of view, using the approach and tools of nonparametric statistics and, 
more specifically, of nonparametric set estimation; see, for example, [6] for 
a survey. 

In a way, the length and surface area estimation problem can be seen 
as a further, more difficult, stage in set estimation theory, after the early 
developments concerned with the estimation of volume (associated with the 
Li (measure) distance; see [8]), "visual" shape (associated with the Haus- 
dorff metric; see [5]), level sets [3, 13, 16, 19, 20] and boundaries [7]. We will 
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see, in fact, that, while the sample data in nonparametric set estimation 
theory comes usually from random points selected inside the set of interest, 
G, we will need here additional information given by sample points coming 
from outside G (see the beginning of Section 2). The estimation of bound- 
ary length has also some practical interest. For example, in medical imaging 
the boundary length appears in connection with the notion of "Contour In- 
dex" (see, e.g., [11]), a shape measurement used as an auxiliary diagnostic 
criterion. These ideas are developed in more detail in Section 4. 

At this point it might be useful to point out what we mean by "nonpara- 
metric approach" in order to clarify its main differences with the stereological 
point of view for these problems: 

(a) Unlike the stereological approach, we are not concerned with unbi- 
ased estimation, but with asymptotic properties such as consistency and 
convergence rates. 

(b) The proposed estimator is intended to work asymptotically in any di- 
mension d under quite general shape restrictions. It depends on a smoothing 
parameter which must be carefully chosen. 

(c) Our method will provide as a by-product an estimator of the boundary 
of the body G under study. In contrast, stereological methods are not usually 
concerned with the global estimation of sets; they are rather focused on the 
estimation of some real parameter (length, volume, surface area, . . . ). 

(d) The sample data consists of randomly selected points. In stereology 
the available information for estimating lengths and surface areas usually 
comes either from one- or two-dimensional sections or from systematic grids. 

Our aim is to obtain an easy-to-implement automatic method valid for 
the analysis of a wide class of images. As a first step we should clearly es- 
tablish what we mean by "surface area." The Hausdorff measure (see, e.g., 
[14]) provides a suitable general definition of this concept. This definition, 
however, is not always very convenient from the point of view of mathemat- 
ical handling and effective evaluation. So we will use instead the following 
simpler, less general notion (which coincides with the Hausdorff measure, up 
to a constant factor, in regular cases): The surface area of a body G C M'^ is 
given by the Minkowski content (see [14], Chapter 2), 



provided that this limit exists and it is finite. Here stands for the ordinary 
Lebesgue measure on W^, dG denotes the boundary of G and, for any A C M'^, 
B{A,e) is the "outer parallel set" B{A,e) := IJ^g^ i?(x, e), where i?(x,e) 
denotes the closed ball with center x and radius e. While the Minkowski 
content fails to satisfy some interesting properties, such as u-additivity, it 
has a clear intuitive basis and is sufficient for most practical purposes. 




Lo(G') = lim 



k^{B{dG,e)) 
2e 
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This paper is organized as follows. The estimator is introduced in Sec- 
tion 2. Its basic statistical properties concerning asymptotic behavior, bias 
and variability are established in Section 3. A real-data application in car- 
diology is discussed in Section 4. A brief Monte Carlo study is presented in 
Section 5. Section 6 is devoted to the proofs. 

2. The sampling model and the proposed estimator. Let G C M'^ be a 

body whose Minkowski content Lq = Lq{G) is well defined, strictly positive 
and finite. Our goal is estimating Lq which for d = 2 represents the boundary 
length and for d = 2> the surface area. Without loss of generality, we will 
assume that G is a subset of the open unit square (0, 1)'^. 

The sampling information is given by i.i.d. observations (Zi, 5i ),..., 
of a random variable (2^,5), where Z is uniformly distributed on the unit 
square [0, l]'^ and 5 = 1 if Z G G, 5 = if Z ^ G. This means that, with 
probability one, given a sample of points on the unit square, we are able to 
decide whether or not they belong to the "green area" G or to the "red" 
one, R=[^,lY\G. 

It will be convenient to use the following notation. Let us denote by Px 
and Py the conditional distributions of the "green" and "red" observations, 
that is, the distributions of Z\{5 = 1} and Z\{5 = 0}. Observe that Px and 
Py are both uniform on G and R, respectively. Now, given z G [0, 1]*^ and 
e > 0, denote by Gz{£) and Rzie), respectively, the numbers of green and 
red sample observations belonging to the ball B{z,e), that is, 

n 

Gz{e) = Gn,z{£) = ^'^{Si=l,\\Zi-z\\<e}^ 

Rz{e) =Rn,z{£) - ^\5i=0,\\Zi-z\\<e}- 
i=l 

Clearly, Gz{£) has a binomial distribution with parameters n and pxi^, s) = 
P{\\Z — z\\ < e,6 = 1) = ^{G)Px{B{z,e)). Similarly, -^^(e) has a binomial 
distribution with parameters n and py{z,e) = (1 — fi{G))Py{B{z,e)). 

Let {sn} be a deterministic sequence of positive numbers which converges 
to zero as n tends to infinity. Denote T = dG. We propose the following 
estimator for the "dilated boundary," B{T,£n)'- 

(3) T„ = {z G [0, 1]"^ : Rzien) > 1 and G,(e„) > 1}. 

The simple intuitive idea behind r„ is to consider those points z in 
whose vicinity green and red points coexist. Of course, we could "robus- 
tify" this estimator by replacing the condition Rz{sn) > 1 and Gz{sn) > 1 
with Rz{£n) > Ti and Gz{sn) > ffi, for some fixed integer numbers ri > 1 and 
gi> 1. This modified estimator (which will not be considered here) would 
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be smoother and less noisy than the original version (3) at the expense of 
some efficiency loss. 

Finally, the definition (1) for suggests the following natural estimator 
for Lo = Lo{G): 

(4) L„ = f^. 

As usual, the nonparametric estimator (4) depends on a smoothing pa- 
rameter e„ which must be carefully chosen. In general, it should tend to 
zero slowly enough. The theoretical results of Section 3 will provide some 
additional insights in this respect. 

Note that the proposed method could be useful even in those cases where 
the image G is completely known (e.g., we could have a picture of G), 
but it is too complicated for directly measuring its boundary. Then the 
sample Zi, . . . , Z„ can be artificially generated provided that we are able to 
decide whether Zi belongs to G or not. So, in some sense, (4) can be seen 
as a "stochastic" algorithm to approximate Lq. This idea will be further 
developed in Section 4. 

3. Theoretical results. We analyze in this section the properties of the 
estimator L„ of the Minkowski content, Lq = Lq{G). 

3.1. Strong consistency. The almost sure (a.s.) convergence of L„ to Lq is 
established in Theorem 1 below. The "standardness" hypothesis (a) prevents 
the set G from having "too sharp" inlets and peaks along the boundary T. 
This condition has been previously used in set estimation (see, e.g., [7]). 

Theorem 1. Let us assume the following conditions. 

(a) The sets G and R are both standard in T , that is, there exists a 
constant C > such that, for small enough e, 

PxiB{t,e))>Gi2{B{t,e)) and PYiB{t,e)) > Gfi{B{t,e)) forallteT. 

(b) The sequence {Sn} satisfies 



Then 



and > oo. 

logn 



, l^{Tn) J 
Ln = — > Lq, a.S. 



2e„ 



Observe that the conditions imposed in (b) on the sequence of smooth- 
ing parameters are identical to those required for the strong consistency of 
kernel density estimators (see, e.g., [15]). However, as we will see below, the 
role of the smoothing parameter is quite different in both setups. 
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3.2. The function L[e). For a given value of n, the estimator L„ pro- 
vides, in fact, an estimation for L(e„) := /_f(i?(T, e„))/(2e„) which, in turn, 
is an approximation of the target value Lq. Thus, in order to assess the ac- 
curacy of the estimator L„, it is interesting to get more precise information 
on the difference \L{£) — Lq\. We next show that, under some smoothness 
assumptions, L[e) is differentiable at e = 0, which entails — Lq\ = 0{e). 
Indeed, note that 

B{T,e) = B{G,e)r\B{R,e), 

which leads to 

(5) ^x{B{T,e))=^Ji{B{G,e)) + ^Ji{B{R,e))-^Ji{B{[Q^Y,E)). 

Thus, the point is to have some idea about the structure of the "dilated 
measures" on the right-hand side of (5), when considered as functions of e. 
If G is assumed to be convex, the classical Steiner formula (see, e.g., [17], 
page 197) establishes that fi{B{G,e)) is a polynomial in e of degree at most 
d. Unfortunately, this result is not useful in our case, as the hypothesis of 
convexity for G could be too restrictive (e.g., in image analysis) and, in any 
case, it cannot be assumed simultaneously for both G and R = [0, 1]"^ \ G, 
except in trivial situations. However, we will be able to prove the required 
differentiability property for L{e) by combining some ideas of mathematical 
morphology (which we will use to impose the appropriate regularity condi- 
tions on G) with a (partial) generalization of Steiner's formula proved by 
Federer [10]. He imposes a positive reach condition closely related to the fol- 
lowing rolling condition often used in set estimation (see, e.g., Walther [20]): 
It is said that a ball can roll along T = dG outside G C M'^ if there exists 
ro > such that, for all r < ro and x G T, there exists a closed ball of radius 
r, Bx, such that BxCiG = {x}. 

A deep study of this outer rolling condition, including some interesting 
equivalences, is due to Walther [21], Theorem 1. This condition arises in 
mathematical morphology, a branch of the huge current theory of image 
analysis; see [18]. It has also appeared, under a slightly different form, in 
contexts not directly related to image analysis. In a similar vein, Federer [10] 
defines the reach of G as the largest (possibly oo) value tq such that if x G M'^ 
and the distance from x to G is smaller than tq, then G contains a unique 
point nearest to x. For our purposes of better understanding the nature of 
the function L{e), it will be particularly useful to employ a generalization 
of Steiner's formula obtained by Federer ([10], Theorem 5.6). This result 
establishes that, for any set G of positive reach ro, the function fj,(G,e) 
coincides locally [for e £ (0, ro)] with a polynomial of degree at most d whose 
independent term is ^J-{G). 

Thus, if we assume that both G and R satisfy the positive reach condition 
we may use Federer's theorem, together with (5), to conclude that iJ,{T,e) 
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coincides in the interval (0,ro) with P{£), where P denotes a polynomial of 
degree at most d with a null independent term. Note that (by the assumption 
made on the finiteness of the Minkowski content Lq) the coefficient of e in 
P(e) must necessarily coincide with 2Lq so that L{£) — Lq is a polynomial in e 
with a null independent term. In particular, we get that is differentiable 
at e = 0. 

3.3. Bounds for E{Ln): Li- consistency and convergence rates, variability 
and bias. It is not hard to show (see the proof of Statement 1 in the proof 
of Theorem 1) that, with probability one, C -B(T, e„) and, therefore, 

(6) Ln < L{£n) a.s. 

This means that L„ tends to underestimate Lq for those "regular" sets where 
the values of the function L{e) = fi{B{T,£))/{2e) are very close to L(0) := Lq 
for small values of e. In the bidimensional case the simplest example is given 
by the circle, for which L(e) = Lq. 

The following result provides a lower bound for E{Ln). 

Theorem 2. Assume that the standardness condition (a) in Theorem 1 
holds. Assume also that the function F{£) := iJ,{B(T,e)) is differentiable in 
a neighborhood of and the derivative F' is continuous at 0. Then 

(7) E{Ln) > L{en) - In, 

where In = /_b(T£„) ^^P(~-^^(^n ~ d{z,T)Y)dz, K being a positive con- 
stant and d{z, T) = inf{||z — t|| : t G T}. Also, 

(8) In = 0{{nei)-''\ 

The proof is given in Section 6. Note that, according to the discussion 
in Section 3.2, if we assume that both G and R fulfill the positive reach 
property, then the function F{£) = fi{B{T,£)) coincides in a neighborhood 
of with a polynomial of degree < d, so it is certainly differentiable at 
with a continuous derivative. 

The following corollary (the proof is in Section 6) provides a condition for 
the Li-consistency, as well as an upper bound for the Li-convergence rate 
of the estimator L„. 

Corollary 1. (a) Under the same conditions of Theorem 2, we have 

(9) E\Ln-LQ\<In + \L{£n)-LQ\. 

As a consequence, the standard conditions for consistency, £„ ^ and ne^ — > 
oo, are also sufficient here to ensure the Li- consistency E\Ln — Lq\ -^0. 

(b) By assuming further that G and R satisfy the positive reach condition 
mentioned in Section 3.2, we have that the optimal order for the bound (9) 
is 0{n~^/'^'^) , which is attained for e„ = n~^^'^'^. 
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Not surprisingly, the bound 0(n~^/^'^) corresponds to a rather slow con- 
vergence rate. We do not believe that the exact rate can improve much on 
this bound. Recall that the typical rates for the much easier problem of con- 
sistently estimating the boundary dG, with respect to the Hausdorff metric, 
are of type 0( (log n/n)V'i) [7] even under the assumption of convexity on G 
[9]. Anyway, in some applications (see Section 4) the estimator L„ is based 
on artificial (Monte Carlo) samples and the slow convergence rate is not so 
crucial a problem, as the sample size can, in principle, be increased as much 
as necessary. 

As a further consequence of (6)-(8) we get [under the regularity assump- 
tions imposed in Corollary 1(b)] the following bounds for the Li-variability 
and the bias: 

ElK - E{Ln)\ < E\Ln - L{en)\ + \L{en) - E{Ln)\ 

(10) 

<2/„ = 0((n4)-V'^), 
Lo - E{Ln) = {L{en) - E{Ln)) + (Lo - L(e„)) 

(11) 

= 0{[nei)-^/'')+0{en). 

Thus, the assumption ne^ — > oo guarantees the convergence to zero of the 
variability around the mean, E\Ln — E{Ln)\- Note that this condition is 
identical to that imposed in the classical (L2 or Li) theory of density es- 
timation in order to control the variability term. However, expression (11) 
shows that ne^ — > co is also useful to make the bias term tend to zero. This 
is in sharp contrast with the typical situation in nonparametric functional 
estimation where e„ usually suffices to kill the bias. The situation here 
is a bit different: We do need the condition e„ — > 0, but if the convergence 
is too fast, the estimator L„ will be biased, underestimating the value of 
Lq. Thus, the bias is also controlled by the condition ne^ — > 00 which is 
used in the proof of Theorem 1 to prevent the boundary estimator from 
having spurious "holes" [that would lead to underestimation of n{B{T,en)) 

by f^{Tn)]- 

Let us also note that it is interesting to assess the magnitude of the "effec- 
tive bias" E{Ln) — L{en)- This is particularly useful in practical applications 
(see Section 4 below) when one is willing to consider L(e„) as a reasonable 
approximation for Lq, thus, accepting a systematic bias which hopefully 
would affect in a similar way all the images under study. In these cases the 
focus is on the differences £'(L„) — -L(e„), analyzed above. 

4. Applications to image analysis. Let us first emphasize that our ap- 
proach is basically aimed at those cases where only partial (random) in- 
formation is available, rather than dealing with completely known images. 
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These usually come in a digitized form, but the digitization process is itself 
an approximation involving nontrivial problems, largely beyond the scope of 
this paper. The classical book by Serra [18], pages 211-224, provides some 
deep insights in this regard. Anyway, if we have a "known" image, either in a 
digitized version or in a "exact" format (e.g., the area inside a known closed 
curve: see Section 5), it is tempting to check the behavior of the estimator 
(4), based on Monte Carlo random samples, when used to approximate the 
boundary length. 

In this section we develop this idea and apply it to a medical example. 

4.1. The contour index. A case study in cardiology. The irregularity 
in the border of a tumor or an infarcted area is an auxiliary diagnostic 
criterion for malignancy assessment. The so-called "contour index" (CI) 
provides a size-independent quantitative measurement of boundary rough- 
ness. It is defined (for the case of bidimensional images) as the quotient 
boundary length /^J area. Its minimal value (2^/??) is attained by the circle. 
The CI has been used in oncology (see, e.g., [4]) and cardiology [11], but the 
interpretation of this index in the two scientific fields is somewhat different. 
A high value of the CI in a tumor usually suggests a high dissemination ca- 
pacity of the injured area. On the contrary, in cardiology the prognosis of an 
infarction tends to be worse when the damage is highly concentrated with a 
"regular" border (which will provide a small CI) rather than disseminated 
in many small irregular patches. 

In order to assess the applicability of our estimation method to real ex- 
amples, we have analyzed an image (Figure 1, left) of the infarcted heart of 
a pig. It corresponds to one side of a transversal section of the heart which 
has been exposed to a histochemical reaction that dyes the living cells. Thus, 
the infarcted cells fail to catch the color, appearing as a white-grayish area 
in the upper-right side of the image. This area should not be confounded 
with the endocardial endothelium (which covers the inner part of the heart). 
It appears in deep white at the centre of the image. In fact, most of this 
endothelium white area is not placed in the same plane as the considered 
transversal section. The jpg file of the original color image (Figure 1, top left) 
has been digitized in an array of 495 x 710 pixels. The information stored 
in every pixel consists of a vector {xi,X2,x^) indicating the level (on the 
scale 0-255) of primary colors (red, green and blue) at that point. So, if we 
consider the position coordinates, every pixel is, in fact, a five-dimensional 
observation. 

4.2. A stochastic algorithm for calculating the CI. In the example con- 
sidered the goal is to identify the infarcted area and give an approximate 
value for the CI. Our estimation method has been used with the following 
steps: 
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1. Image identification and cleaning. The image of interest (Figure 1, top 
left) must be treated in order to clearly decide the precise shape of the 
infarcted area (a bit blurred in the original picture) whose boundary 
is to be measured. The problem is to decide whether or not a pixel in 
the picture corresponds to the infarcted area. We have done this using 
the classical Fisher linear discriminant function. To put this in more 
precise terms, two large samples of pixels have been taken in the infarcted 
and in the noninfarcted area. Then the classical linear discrimination 
method was applied to classify the remaining points. The classification 
error was negligible except for the points in the white endothelial area 
at the centre of the original image (that tended to be confounded with 
the infarcted cells), where the error rate was appreciable. The result of 
this automatic discrimination-based treatment is shown in Figure 1 (top 
right) where the infarcted area has been colored in black but there are also 
some patches of obviously misclassified endothelial tissue. Thus, a final 
"manual cleaning" was made to remove these patches. The result is given 
in Figure 1 (bottom) . This was the final image (600 x 600 pixels) used for 




Fig. 1. An infarcted heart (top left). The estimated infarct area (top right). The 
cleaned'^ infarct area (bottom). 



10 



A. CUEVAS, R. FRAIMAN AND A. RODRIGUEZ-CASAL 



the quantitative analysis described in Section 4.3. Let us note that the 
classification algorithm has been based only on the "color" coordinates 
of every point. We have disregarded the information provided by the 
point positions because the use of a linear discrimination method looked 
particularly unsuitable when these variables are involved. 

By the way, this application of discriminant methods in image cleaning 
shows the interest in studying discriminant theory from the point of view 
of image analysis; this would amount to incorporating classification crite- 
ria based on shape preservation (connectedness, smoothness), in addition 
to the usual notions relying on misclassification probabilities. 

2. Monte Carlo sampling and classification. A large artificial uniform sample 
Zi,. . . ,Zn is drawn in [0, 1]"^. The classification variable 6i is obtained for 
each Zi:5i = l when Zi belongs to the infarcted area, 5i = otherwise. 

3. Estimation. As indicated in Section 3, the optimal order (under some 
shape restrictions) for e„, is n"^/^*^. The estimator (4) (and the corre- 
sponding boundary estimator T„) is obtained for several values of the 
smoothing parameter e„. The idea is to check the sensitivity of the esti- 
mation process with respect to changes in the value of £«• Alternatively, 
some procedure (cross-validation, bootstrap-based choice) for the opti- 
mal selection of the smoothing parameter could be used. However, in real 
applications (see Section 4.3 below) an optimal choice would be not so 
crucial when the procedure is used to establish comparisons between sev- 
eral images. In the case of a bi-color digitized image the calculation of 
yu(r„) (and that of the area that appears in the denominator of the CI) 
is made by a simple count of the corresponding activated (black) pixels. 

Note that, in practice, the first stage could be omitted as, strictly speaking, 
only the randomly selected points need to be classified. An interesting open 
problem in this regard would be to consider a more realistic model incorpo- 
rating the classification error in the "red" or "green" areas, G and R (see 
Section 2). 

4.3. Results. In the example of Figure 1 we have considered two sample 
sizes, n = 50,000 and n = 100,000. The results are summarized in Table 1. 

The choices of the values of type Cfcn where the constants Ck, 

for k = 1,2, 3, are taken in order to consider small perturbations around the 
reference value n~^/^. In the case n = 100,000 we have n-V4 = 0.0562, so 
we decided to take Ci, C2 and C3 in order to get "exact" values (0.05, 0.02 
and 0.01) for the smoothing parameter CkSn- This entails that Ci = 0.8897, 
C2 = 0.3559 and C3 = 0.1779 and we have kept these constants for the case 
n = 50,000. 

The output in Table 1 indicates that the CI value is about 5.4. Clearly, 
the values (3.61, 3.96) obtained for the largest choices of e.„ correspond to 
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Fig. 2. Oversmoothed boundary estimation of the infarct area in Figure 1. 

oversmoothed estimations; recall that the CI for a circle is 3.5449. This is 
apparent from the image of Figure 2, which shows the estimate T„ of the 
infarct boundary for the case n = 50,000, e„ = 0.05. 

A remarkable fact in the results is their small variability. This means 
that, in practice, we can use a given (not necessarily optimal) choice of 
to perform comparisons between different images. Maybe the true CI's are 
estimated with an appreciable bias, but this is, by far, the main source of 
error. Thus, the estimated CI's would allow us to get an assessment of the 
relative importance of the different cases from the point of view of infarct 
geometry, and the value of e„ corresponds, in some sense, to the resolution 
level employed in the procedure. 

It is also worthwhile to observe that due to the presence of in both 
the numerator and denominator of (4), the variability of this estimator is 
not a monotone function of £„. This is in contrast to the usual behavior of 
nonparametric estimators (e.g., kernel density estimators). 

The estimation CI ~ 5.4 suggests a rather negative diagnostic for the 
infarct shown in Figure 1. For example, in [11] the "infarct geometry" of a 
control group of eight infarcted pigs was studied and compared with that 
of another treatment group of eight individuals, also suffering a miochardial 
infarct but receiving a drug called 2,3-butanedione monoxime. The values 
found for the CI in the control and the treatment group are 7.7 it 0.2 and 
9.4 lb 0.7, respectively, which suggests a much better prognosis than that in 
our example. 

In the case of the digital images, the choice of the smoothing parameter 
£n is obviously limited by the pixel size. In our case, each side of the square 



Table 1 

Average values and standard deviations along 100 replications of the CI estimation for 

the infarct area in Figure 1 



Sample size 




n = 50,000 






n = 100,000 




En 


0.0119 


0.0238 


0.0595 


0.01 


0.02 


0.05 


Mean 


5.2080 


5.1265 


3.6104 


5.7257 


5.53 


3.96 


Standard deviation 


0.0042 


0.00342 


0.0129 


0.0294 


0.0213 


0.0099 
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[0, 1] X [0, 1] was divided into 600 square pixels so that the minimum effective 
choice of e„ would be 1/600 = 0.0017. 

On the other hand, the large sample sizes (n = 50,000, 100,000) used in 
the study suggest the idea of using all the available pixels (360,000 in this 
example). The practical implications of such an "exhaustive method" are 
analyzed in some detail below [see paragraphs (g) and (h) in Section 5.2], 
in connection with the simulation example considered there, where the true 
value of the boundary length is exactly known. 

The relative simplicity of the proposed method suggests the possibility 
of generalizations to multicolor higher-dimensional images; these could ap- 
pear in the context of magnetic resonance explorations where very precise 
determinations are obtained for different magnitudes as the pH or the ATP 
(which measures the energy cell status). 

5. Simulations. The estimator (4) is designed for cases where only in- 
complete information (given by "natural" sampling points on both sides of 
the border) is available. In this sense, the proposed method can be seen as 
a refined version of the nonparametric method for estimating boundaries 
discussed in [7]. The requirement of two samples (inside and outside the set) 
can be formalized with different models, but seems to be unavoidable in or- 
der to estimate the surface measure, unless we are willing to impose strong 
assumptions on the shape of G. On the other hand, the estimator (4) can be 
based on Monte Carlo (artificial) samples, to be used in contexts not directly 
related to image analysis, just as a stochastic device for approximating the 
length of a closed curve or the surface area of a body in M^. 

As an example, we have considered the so-called Tschirnhausen Cubic 
(also known as Catalan's trisectrix and I'Hospital's cubic), a plane curve 
whose polar equations are 

r = asec^(6l/3), for G (0, vr), 

r = asec^ ((271-6*) /3), for 6* G (vr, 27r). 

The reason for choosing this curve is the existence of closed simple ex- 
pressions for both the length (Lq = 12o\/3) of the loop and the area inside 
{A = 720^^3/5). We have used our estimation method in order to approx- 
imate Lq and A in the case a = 1 (see Figure 3), so the target values are 
Lq = 20.7846 and A = 24.9415. The random samples, with sizes n = 30,000 
and n = 10,000, are drawn in the square [—9,2] x [—5.5,5.5], which fully 
includes the Tschirnhausen loop. 

Before discussing the simulation experiment and output, we should con- 
sider a practical issue regarding the effective calculation of the estimator. 
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Fig. 3. The Tschirnhausen Cubic. 



5.1. Monte Carlo approximation of the estimator. The estimator (4) 
(and the corresponding boundary estimator T„) must be computed for every 
choice of considered. An important practical problem is that the direct 
computation of ^(7n) is not an easy task. However, it can be approximated 
easily, with an arbitrary precision level, by using again the Monte Carlo 
method. Let be a random sample (independent of 

from the uniform distribution on the unit square [0, l]'^. Since, with proba- 
bility one, Tn C [0, l]'^ for n large enough, we have that iJ,{Tn) = P{Z^ G Tn) 
and therefore, for B large, 



l-l-BiTn) 



B 



should be a good approximation of ^{Tn). Note that it is very easy to check 
when a point Z* belongs to r„. This Monte Carlo method provides an 
approximate evaluation for L„, 



T* 



2en 



An interesting question in order to apply the proposed method is how to 
pick B (as a function of n) to ensure that L* ^ is a consistent estimator of 
Lq. The next theorem gives an answer to this question. The proof is given 
in Section 6. 



Theorem 3. Besides the hypothesis of Theorem 1, let us assume that 

(12) -^^oo. 

logn 

Then L* ^ ^ Lq, a.s. 
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5.2. Simulation output. The results of our simulation study are summa- 
rized in Table 2. The estimator L„ has been evaluated for 500 samples of 
sizes n = 30,000 and n = 10,000. The resampling parameter B, used in the 
Monte Carlo approximations of /i(T„), was B = 1500 in all cases considered. 
The output in Tables 2 (for n = 30,000) and 3 (for n = 10,000) provides the 
average, standard deviation and median of L„ computed from the 500 repli- 
cations, for different values of e„. The output is obtained using the same 
simulated samples for each value of e„. Thus the usual Monte Carlo area 
estimate, which does not depend on a smoothing parameter, is the same in 
all cases. The average, standard deviation and median obtained for this area 
estimator are respectively 24.9196, 0.4458 and 24.9125 for n = 30,000 and 
24.9485, 0.7842 and 24.9889 for n= 10,000. 

Some direct conclusions can be drawn from these results: 

(a) The true value Lq = 20.7846 is systematically underestimated with a 
relative error about 4.7% (in the case n = 30,000) and 8.1% (for n = 10,000). 
The gain obtained by increasing the sample size is mostly apparent in the 



Table 2 

Average, standard deviation and median of Ln computed over 500 replications with 

n = 30,000 



en 


0.76 


0.78 


0.80 


0.82 


0.84 


0.86 


0.88 


Average 


19.7301 


19.7416 


19.7621 


19.7644 


19.7918 


19.7918 


19.7859 


Std. deviation 


1.3940 


1.3935 


1.3793 


1.3448 


1.3470 


1.3200 


1.3072 


Median 


19.7548 


19.7920 


19.8274 


19.7576 


19.8930 


19.8249 


19.8081 




0.9 


0.92 


0.94 


0.96 


0.98 


1.0 


1.2 


Average 


19.7901 


19.7949 


19.8109 


19.8208 


19.8290 


19.8230 


19.8237 


Std. deviation 


1.2952 


1.2917 


1.2636 


1.2331 


1.2159 


1.2031 


1.0666 


Median 


19.8863 


19.9150 


19.8522 


19.8804 


19.8209 


19.8804 


19.8627 



Table 3 

Average, standard deviation and median of Ln computed over 500 replications with 

n= 10,000 





0.76 


0.78 


0.80 


0.82 


0.84 


0.86 


0.88 


Average 


19.0026 


18.8594 


18.9512 


18.9370 


18.9507 


19.0806 


19.1083 


Std. deviation 


1.3908 


1.3586 


1.3260 


1.2627 


1.3075 


1.3398 


1.2467 


Median 


18.9736 


18.9221 


18.9791 


18.9300 


18.9842 


19.0359 


19.0852 




0.9 


0.92 


0.94 


0.96 


0.98 


1.0 


1.2 


Average 


19.0492 


19.1409 


19.1408 


19.2057 


19.2134 


19.2384 


19.3679 


Std. deviation 


1.2956 


1.1936 


1.1599 


1.2460 


1.2157 


1.1777 


1.0394 


Median 


19.0381 


19.1774 


19.1303 


19.1735 


19.2150 


19.2548 


19.3679 
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bias. The average (over all values of e„) of the average output is 19.0890 for 
n = 10,000 and 19.7900 for n = 30,000. 

(b) The simulation output shows a considerable stability with respect to 
the values of the smoothing parameter e„. This stability remains even for 
other smaller values of e„ (not included in the tables) that we have checked. 
For example, whereas the average of the average values of L„ obtained from 
the 14 choices of e„ included in Table 2 is 19.79, the corresponding average 
for the other five equispaced values of En, ranging from 0.64 to 0.72, is 
19.5985. 

(c) The sampling distributions are almost symmetric, with the median 
very close to the mean in all cases. 

(d) There is a slight, but consistent, decline of the variability around the 
mean as the smoothing parameter increases. 

(e) As could be predicted, the variability results tend to improve, at the 
expense of some additional computational burden, by increasing the value 
of the resampling parameter B. For example, the output for the average, 
standard deviation and median of L„ with e„ = 0.92, n = 30,000 and B = 
2000 is 19.8341, 1.0790 and 19.8458, respectively. For n = 10,000, with the 
same value of e„, the corresponding output for B = 3000 is 19.2229, 0.8490 
and 19.1774. These results account for the small changes in the variability of 
Ln from n = 10,000 to n = 30,000. They suggest that, for these sample size 
magnitudes, most variability is due to the Monte Carlo approximation stage 
of the numerator /i(T„) in (4), controlled by the parameter B. The value 
B = 1500, used in the simulations of Tables 2 and 3, should be considered as 
a first computationally affordable choice, suitable for this preliminary study. 

(f) The plots of the density estimators obtained from the values of L„ 
suggest that the sampling distribution is, for all the considered choices of 
En, very close to normality. As a consequence, an interesting open problem 
would be to establish the asymptotic normality of L„. However, the proof 
seems far from trivial in view of the special structure of L^. 

(g) In this example we have implemented our method in a case where an 
exact equation for the border is known. So no digitization process is involved. 
In practice, most real black-and-white images come in a digitized version. 
In mathematical terms this amounts to replacing the original image G by a 
finite union of square pixels with sides of fixed length h, parallel to the 
coordinate axes. In such situations one could think of exactly measuring the 
border length of the "digital boundary" dG^. This is just the number 
of pixel sides separating regions of different colors. This is computationally 
feasible and avoids the use of any smoothing parameter. However, it is not 
difficult to see that this direct exhaustive procedure will fail, as cannot 
converge to L when h tends to 0. For example, if dG includes a segment A in- 
side the diagonal x = y, the length of A will be overestimated by a factor 
when G is approximated by G^. This is empirically confirmed in our case: 
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If we replace the region G inside the Tschirnhausen Cubic by a digitized 
version, obtained by dividing the "frame square" [—9,2] x [—5.5,5.5] into 
300x300 pixels, the direct exhaustive method gives an estimation = 25.97 
for the true value Lq = 20.78. Our method, with n = 10,000, provides much 
more acceptable estimation around 19.7 (see Table 3). The use of a more 
precise digitization does not improve things (in fact, it reveals a lack of con- 
sistency in the exhaustive procedure). For example, a 600 x 600 digitization 
leads to ~ 26.5, and with 1024x1024 pixels we get L'^ ~ 28.1875. 

The exhaustive method could also been implemented in an indirect ver- 
sion, based on measuring areas. The boundary length could be estimated by 
Area(G''*)/2/i, where G^^ denotes the union of all "boundary pixels" in G . 
This also fails: estimation for the 300 x 300 and 600 x 600 digitizations gives 
respectively 19.36 and 19.32. Note that, in fact, this procedure uses implic- 
itly a smoothing parameter (the pixel side length h). The failure should be 
interpreted as a phenomenon of undersmoothing; see the comment about 
the bias after (10) and (11). 

(h) The use of all the available pixels is still a possibility, although, in 
view of the previous comment it should always be done with an appropri- 
ate amount of smoothing, along the lines indicated above. Although this 
exhaustive procedure "with smoothing" is feasible in many cases, it is not 
advisable in general, due to its lack of robustness against the "noise" (in 
the form of disperse error pixels not belonging to the image). By contrast, 
the method based on random samples will automatically ignore (with high 
probability) the possible disperse noise, at the expense of higher variabil- 
ity. We have checked this by randomly adding four patches of noise, in the 
form of circular clusters (with radii 0.25) of black pixels, within the square 
[—9,2] X [—5.5,5.5], where the loop of the Tschirnhausen Cubic is included. 
In the worst case (when the four noise patches fall on the white background, 
outside the black image), the amount of noise added to the image represents 
less than 1% of the total number of pixels. The presence of the noise turned 
out to have a devastating effect in the exhaustive method with smoothing: 
The average length obtained with this method for 500 of such noisy images 
is 24.92 (standard deviation 1.63), whereas the random method applied with 
a sample size n = 5000 and e„ = 0.94 gave an average of 21.07 (standard de- 
viation 0.9992). Curiously enough, the results for the latter method (recall 
that the true value is 20.78) are even better than those obtained in the case 
with no noise since the noise tends to increase the boundary length, thus 
partially correcting the inherent underestimation bias. 

6. Proofs. 

Proof of Theorem 1. The result is a consequence of the following 
two claims. 
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Statement 1. With probability one, Tn C B{T,en)- 

Statement 2. For any < a < 1, we have eventually, with probability 
one, 

B{T,e'JcTn, 

where e'^ = aSn , < a < 1 . 

Proof of Statement 1. For any z G T„, we have that (with proba- 
bihty one) B{z,en) meets G and its complementary R. Therefore, B{z,en) 
meets the boundary of G, T, which means that z belongs to B{T,en)- This 
concludes the proof of Statement 1. □ 

Proof of Statement 2. By the Borel-Cantehi lemma, it is sufficient 
to show that 



Y,P{BiT,e'J^Tn)<oo. 

n=l 

However, 

P{B{T, e'^) ^ Tn) < P{3z G B{T, e'J : = 0) 

(13) 

+ P{3zeB{T,e'J:R,{en) = 0). 

Now, we try to find an upper bound for the first probability on the right- 
hand side. The other probability can be bounded by a similar argument. 

For any z G B{T,Sn), there is an t G T for which B{t, fin) C B{z, En)-, where 
/3n = (1 — «)£„. Therefore, 

P{3z G B{T, e'n) : (?,(£„) = 0) < P{3t G T : Gt{Pn) = 0). 

Let T{Pn) be a set [with cardinality D{Pn)] of ball centres corresponding 
to a minimal covering of T by balls of radius /3n/2. So we consider a class 
{B{s, (3n/2) : s G T{l3n) C T} such that 



Tc [j B 

sGT(/3„) 



^' 2 



Since {3t GT:Gt{Pn) = 0} C {3s G T{(3n) : G,(/3„/2) = 0}, we have 
P{3t G T : GtiPn) = 0)<p(3se r(/3„,) : (^^) = 
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sGT(/3„) 

where in the last inequahty we have used the fact that 1 — x < for 
< X < 1. The right-hand side of the above inequahty can easily be bounded 
since, from the standardness hypothesis, for n large enough, 

where cJd = fi{B{0, 1)) and Ki is a constant which depends on the dimension 
d, a, and C. Therefore, 

P(3z G B{T, e'J : = 0) < D{pn) exp{-i^i4}. 

Now, in order to bound the function D{e), recall that it represents the 
cardinality of a minimal covering C(e/2) of T by balls of radii e/2. This 
entails (e.g. [14], page 78) that there exists a family of -D(e) disjoint balls 
with radii e/4 and centres at points of T. Then the sum of their measures 
must be smaller than fi(B{T,e/4:)). Hence, 

D{e){e/Afuja<fi{B{T,e/4)). 

Since L{e) — > Lq, we get for e small enough, D{e) < Ae^~'^ for some constant 
A. Therefore, 

P{3z e B{T,e'J:G,{en) = 0) < iTae^-'^expC-Kme^), 

where K2 = (1 — a)^~'^A. The condition ne^/logn 00 ensures the con- 
vergence of the series £n~'^6xp(— i^iJie^). The other probability in 
(13) can be bounded in a similar way. Note that the obvious inequality 
D{£) < Ae~'^ would also suffice for the purpose of convergence, but the above 
simple argument provides a sharper bound for the probabilities. This con- 
cludes the proof of Statement 2. □ 

Now the proof of Theorem 1 is a straightforward consequence of State- 
ments 1 and 2. Indeed, we have that, with probability one, 

«Lo = lim ^^^^^'^^^^ < hm inf L„ < lim sup L„ < lim ^^^^^'^"^^ = Lq. 
" 2en " n " 2e„ 

This holds for any a G (0, 1) and therefore, the conclusion of the theorem 
follows. □ 
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Proof of Theorem 2. The expected value of L„ can be written as 

since, with probabihty one, r„ C B{T,en)- It is clear that 

(14) P{z i Tn) < P{G,{en) = 0) + P{R,{en) = 0). 

Remember that Gz{£n) has a binomial distribution with parameters n and 
px{z,en)- Therefore, 

P{Gz{en) = 0) = {l-px{z,en)T < exp{-npx{z,en)}. 

Let Ptz gT he the projection of z onto T. Since, for any z G B(T,en), 

B{PTZ,en-d{z,T))cB{z,en), 

using condition (a) of Theorem 1, we have that, for e„ small enough, 

Px{B{z,en))>CiOd{en-d{z,T)f. 

Hence, 

P{G, = 0) < exp{-Kin(e„ - d{z, T)f}, 

where Ki is a positive constant which depends only on ^(G), G and the 
dimension d. Similarly, we have that P{Rz = 0) < exp{— ir2n(e„ — d{z, T))"^}, 
for a positive constant K2 which depends only on n{R), G and d. Using these 
bounds and (14), we get 

P{z e r„) > 1 - 2exp{-i^n(e„ - d{z,T)f}, 
where K = mm{Ki,K2). Thus, we have that 

EiLn) = ^ I Pize Tn) dz 

[ (l-2exp{-Kn(e„-d(z,r))'^})dz 
= L{en) - — I exp{-Kn(e„, - d{z, T))'^} dz = L(e„) - I„, 

en JB{T,e„) 

with 

In = — [ gn{d{z,T))dz, 

En Jb{T,£„) 



20 



A. CUEVAS, R. FRAIMAN AND A. RODRIGUEZ-CASAL 



where gn{w) = exp{— i^n(en — w^}. By the change of variable formula, we 
have that 

(15) In = — r gnHF{dw), 

En "'0 

where F{w) = fi{{z : d{z,T) < w}) = iJ,{B(T,w)). By the assumption made 
on the continuous differentiability of at and the existence and finiteness 
of the Minkowski content, we have -F'(O) = 2Lq so that, for w small enough, 
^ 3Lo. Finally, for n large enough. 



£n Jo £n Jo d[Kn)^i'^ 

3Ln f°° , , , A 



dK^/d{einy/<^ Jo ' {eiriY''^' 

where in the first inequality we have applied in (15) the change of variable 
t = En — w and then (for the first equality) u = Knf^. □ 

Proof of Corollary 1. The bound (9) for the Li-error follows as a 
direct consequence of the bounds (6)-(8) together with the triangle inequal- 
ity. Now, the conclusion (a) follows from (8) and the definition of Lq. 

To show (b), note that the optimal convergence order for the bound (9) 
is obtained by making equal the convergence orders of both terms on the 
right-hand side. Under the smoothness conditions mentioned in Section 3.2, 
we have \L{en) — Lq\ = 0(e„) (see [10], Theorem 5.6). Thus, from (8), the 
optimal order for the bound (9) is 0{n~^^'^^), which is attained for e„ = 



Proof of Theorem 3. Clearly, it is enough to show that L* ^ - L„ ^ 
0, a.s. This can be proved showing that, for any p> 0, 

(16) Y.P(\KB-Ln\>p)<00. 

n 

This is not hard to do because, given Zi, . . . , Z„, -L* ^ has (essentially) a bi- 
nomial distribution with mean L„, and, therefore, we can use a concentration 
inequality to control the size of its tail. Indeed, 

Pi\Ll^B-Ln\>p) 

= E{P{\Ll^B-Ln\>p\Zi,...,Zn)) 

= E{P{\flBiTn) - piTn)\ > 2pen\Zi, Z^)) 
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where in the last step we have used Bernstein's inequahty. It is not hard 
to bound this last quantity because goes to zero (with probability 

one) as fast as £n when n tends to infinity. To see this, note that in Theo- 
rem 1 we proved that (with probability one) T„ C B{T,en) and, therefore, 
IJ-iTn) < l^{B{T,en))- Since L(e„) Lq, we have that, for n large enough, 
fi{B{T,en)) < iLQEn- So, for n large enough, 

2fi{Tn){l-fi{Tn)) + {4/3)pen 

Ap'elB 



E[ 2exp 



< El 2exp, , , , 

- * ^' 8Loe„ + (4/3)pe„ 

= 2exp{-Kp^Lo£nB}, 

where Kp,Lo is ^ (positive) constant. Obviously, (12) ensures that, for any 
p>0, 

^expi-Kp^L^EnB} < oo, 

n 

and, therefore, (16) holds. This concludes the proof of the theorem. □ 
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